Given two data sets \(\texttt{sales.txt}\) and \(\texttt{item_size.txt}\), out goal is to provide a 52-week demand forcast. We first import, clean, and transform the data that will facilite time-series analysis.
size = read.delim('item_size.txt', header = TRUE, sep = "|", dec = ".")
sales = read.delim('sales.txt',header = TRUE, sep = "|", dec =".")
#Make the transaction_date a date variable
sales$transaction_date = as.Date(sales$transaction_date)
We take an initial look at a sample of each data, and a histogram of transaction dates binned into week.
head(size)
## item_code item_size
## 1 1012-00001-0189-010-999400191165169227 M REG
## 2 1012-00001-0782-001-999500889587049998 L REG
## 3 1012-00001-0613-060-999700190542543223 XXL REG
## 4 1012-00001-5003-040-999600194112525862 XL REG
## 5 1012-00001-0616-930-9995 L REG
## 6 1012-00001-4616-535-999600191932476701 XL REG
head(sales)
## store item_code transaction_date
## 1 1 1103-00001-0035-248-999300706420416901 2016-01-15
## 2 1 TAX 2016-01-04
## 3 1 TAX 2016-01-03
## 4 1 TAX 2016-01-05
## 5 1 1013-00001-0036-001-999500881862256954 2016-01-05
## 6 1 1012-00001-0063-002-999300881862256190 2016-01-30
## transaction_quantity retail_amount
## 1 1 49.99
## 2 0 0.00
## 3 0 0.00
## 4 0 0.00
## 5 1 39.99
## 6 1 97.19
#Histogram of Sales since 2015
hist(sales$transaction_date, "weeks", freq=TRUE, main = "Histogram of Transaction Date", xlab="Transaction Date")
We notice many of the lines in the sales data involve TAX. In these lines contribute no meaningful information and so we remove them.
#Remove the TAX entries
sales = sales[sales$item_code !="TAX",]
Examining the counts of transaction quantity we can examine any patterns in the outliers.
as.data.frame(table(sales$transaction_quantity))
## Var1 Freq
## 1 -10 1
## 2 -1 42048
## 3 0 118
## 4 1 751524
## 5 2 294
## 6 3 29
## 7 4 4
## 8 5 3
## 9 6 2
## 10 8 1
## 11 10 2
## 12 11 1
## 13 16 1
## 14 20 1
## 15 22 1
## 16 25 1
## 17 35 1
## 18 36 1
## 19 37 1
We see that overwhelimingly, each transaction accounts for the purchase of one product, and very few purchase more than 4 in a single transaction. Let’s examine the dates of these large purchases.
sales[sales$transaction_quantity>7,]
## store item_code transaction_date
## 82181 1 6205-00001-0502-001-000000053329460611 2016-04-30
## 87314 1 6205-00001-0502-001-000000053329460611 2016-04-30
## 953267 1 1014-00001-0029-001-999500191165759831 2019-03-30
## 983380 1 1014-00001-0029-001-999400191165759756 2019-03-30
## 993292 1 1104-00001-0009-001-999400191932492169 2019-03-26
## 997037 1 1014-00001-0029-001-999700191165759985 2019-04-09
## 1032512 1 1014-00001-0029-001-999600191165759916 2019-04-09
## 1051676 1 1104-00001-0009-001-999500191932492244 2019-03-27
## 1069625 1 1104-00001-0009-001-999600191932491698 2019-03-29
## 1102735 1 1104-00001-0009-001-999300191932492060 2019-03-25
## 1140528 2 7010-00001-0051-376-000000192826309600 2019-08-14
## transaction_quantity retail_amount
## 82181 10 69.9
## 87314 10 -1100.1
## 953267 22 792.0
## 983380 11 396.0
## 993292 35 1295.0
## 997037 8 288.0
## 1032512 16 576.0
## 1051676 36 1332.0
## 1069625 37 1369.0
## 1102735 25 925.0
## 1140528 20 1602.0
We see that almost all of these major purchases happen between March 25th 2019 and April 9th 2019. What was going on during these day? The North Face and Supreme collaboration. We also remove 3 other extreme purchases that appear erroneous.
#Remove Supreme/North Face Colab
sales = sales[!(sales$transaction_quantity>2 & sales$transaction_date <= "2019-04-10" & sales$transaction_date >= "2019-03-20"),]
#Remove Sales with more than 6, only 3 exist, and they seem erroneous. Also, returns with more than 1 (only one instance).
sales = sales[sales$transaction_quantity<=6,]
sales = sales[!(sales$transaction_quantity <=-2),]
Because we are making a 52-week forecast, we need to collapse all the observations that appear in each week into a single observation. We label each observation with a new column corresponding to the previous (or current) Sunday using the $ function.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
dates <- sales$transaction_date
week_start_date = floor_date(sales$transaction_date , "weeks")
sales = cbind(sales, week_start_date)
We want to transform the data to be useful for time-series analysis. We want to have a column for the date, and a column for the total transaction quantity. We will do this for both stores, and total sales. This will result in 3 data sets that we write out to both .csv and .txt files. We begin with Store 1, which corresponds to the discount store. These 3 data sets do not include any size data.
#Store 1 Sales Data
store1_weekly_transactions = data.frame(matrix(0, ncol = 1, nrow = length(unique(week_start_date))))
names(store1_weekly_transactions) = c("transaction_quantity")
r = 1
for (j in 1:length(unique(week_start_date)))
{
store1_weekly_transactions[r,1] = sum(sales[sales$week_start_date == unique(week_start_date)[j] & sales$store == 1,]$transaction_quantity)
r = r+1
}
store = c(rep(1,length(unique(week_start_date))))
store1_weekly_transactions = cbind(store, unique(week_start_date), store1_weekly_transactions)
names(store1_weekly_transactions) = c("store","week_start_date", "transaction_quantity")
store1_weekly_transactions = store1_weekly_transactions[order(store1_weekly_transactions$week_start_date),]
write.csv(store1_weekly_transactions,"discount_without_size.csv",row.names = FALSE)
write.table(store1_weekly_transactions, "discount_without_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)
Let’s view a sample of this new data set.
head(store1_weekly_transactions)
## store week_start_date transaction_quantity
## 2 1 2016-01-03 1779
## 1 1 2016-01-10 1504
## 6 1 2016-01-17 1598
## 3 1 2016-01-24 1774
## 4 1 2016-01-31 1540
## 8 1 2016-02-07 2073
Similarly, but for store 2, which corresponds to the regular store.
#Store 2 sales data
store2_weekly_transactions = data.frame(matrix(0, ncol = 1, nrow = length(unique(week_start_date))))
names(store2_weekly_transactions) = c("transaction_quantity")
r = 1
for (j in 1:length(unique(week_start_date)))
{
store2_weekly_transactions[r,1] = sum(sales[sales$week_start_date == unique(week_start_date)[j] & sales$store == 2,]$transaction_quantity)
r = r+1
}
store = c(rep(2,length(unique(week_start_date))))
store2_weekly_transactions = cbind(store, unique(week_start_date),store2_weekly_transactions)
names(store2_weekly_transactions) = c("store", "week_start_date", "transaction_quantity")
store2_weekly_transactions = store2_weekly_transactions[order(store2_weekly_transactions$week_start_date),]
head(store2_weekly_transactions)
## store week_start_date transaction_quantity
## 2 2 2016-01-03 0
## 1 2 2016-01-10 0
## 6 2 2016-01-17 0
## 3 2 2016-01-24 0
## 4 2 2016-01-31 0
## 8 2 2016-02-07 0
write.csv(store2_weekly_transactions,"regular_without_size.csv",row.names = FALSE)
write.table(store2_weekly_transactions, "regular_without_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)
We lastly do a combined analysis where we don’t separate by store, and get total sales accross both stores.
#Total sales data
total_weekly_transactions = data.frame(matrix(0, ncol = 1, nrow = length(unique(week_start_date))))
names(total_weekly_transactions) = c("transaction_quantity")
r = 1
for (j in 1:length(unique(week_start_date)))
{
total_weekly_transactions[r,1] = sum(sales[sales$week_start_date == unique(week_start_date)[j],]$transaction_quantity)
r = r+1
}
total_weekly_transactions = cbind(unique(week_start_date), total_weekly_transactions)
names(total_weekly_transactions) = c("week_start_date","transaction_quantity")
total_weekly_transactions = total_weekly_transactions[order(total_weekly_transactions$week_start_date),]
head(total_weekly_transactions)
## week_start_date transaction_quantity
## 2 2016-01-03 1779
## 1 2016-01-10 1504
## 6 2016-01-17 1598
## 3 2016-01-24 1774
## 4 2016-01-31 1540
## 8 2016-02-07 2073
write.csv(total_weekly_transactions,"total_without_size.csv",row.names = FALSE)
write.table(total_weekly_transactions, "total_without_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)
We will now create 3 more data set where we get a different column for each size. That is, for each week (a row) we get the total transactions for each size. At this point it is important that we make an observations. The number of unique item_codes in the sales data is:
length(unique(sales$item_code))
## [1] 66984
While the number of unique item_codes in the size data is:
length(unique(size$item_code))
## [1] 16826
We see that there are many more unique items in the sales data than in the size data. This could mean a number of things. 1) The sales data is not just mens coats, but rather an item sold at the north face store, e.g. shirts, backpacks, hats. Including things that simply don’t have any particular size. 2) The sales data is has coats that are not contained in the size data. Given that the description of the problem indicated that the item_code in the sales data was the “Unique identifier for men’s insulated jackets”, I will option number 2.
Since many of the coats in the sales data do not have recorded sizes, when we merge \(\texttt{size}\) and \(\texttt{sales}\), we only keep the records which have sizes, thus schrinking out dataset considerably.
For this transformation we run a nested for loop to get all the transactions for each week and for each size. We do this for store 1, store 2, and total sales, producing 3 more data sets.
#Get sales data with size column
sales.plus.size = merge(x=sales, y=size, by="item_code")
#Store 1 + sizes
store1_weekly_transactions.plus.size = data.frame(matrix(0, ncol = length(unique(sales.plus.size$item_size)), nrow = length(unique(week_start_date))))
r = 1
for (j in 1:length(unique(week_start_date)))
{
c=1
for (k in 1:length(unique(sales.plus.size$item_size)))
{
store1_weekly_transactions.plus.size[r,c] = sum(sales.plus.size[sales.plus.size$week_start_date == unique(week_start_date)[j] & sales.plus.size$item_size ==unique(sales.plus.size$item_size)[k] & sales.plus.size$store ==1 ,]$transaction_quantity)
c = c+1
}
r=r+1
}
store1_weekly_transactions.plus.size = cbind(unique(week_start_date), store1_weekly_transactions.plus.size)
names(store1_weekly_transactions.plus.size) = c("week_start_date",sapply(unique(sales.plus.size$item_size),toString))
store1_weekly_transactions.plus.size = store1_weekly_transactions.plus.size[order(store1_weekly_transactions.plus.size$week_start_date),]
head(store1_weekly_transactions.plus.size)
## week_start_date L REG S REG M REG XL REG XXL REG XS REG XXS REG 3XL REG
## 2 2016-01-03 60 29 52 29 8 6 0 0
## 1 2016-01-10 28 23 26 17 10 8 0 0
## 6 2016-01-17 50 20 47 26 10 3 0 0
## 3 2016-01-24 46 13 32 19 9 5 0 0
## 4 2016-01-31 25 8 27 10 6 9 0 0
## 8 2016-02-07 48 21 48 36 16 3 0 0
write.csv(store1_weekly_transactions.plus.size,"discount_with_size.csv",row.names = FALSE)
write.table(store1_weekly_transactions.plus.size, "discount_with_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)
Similarly for store 2.
#Store 2 + sizes
store2_weekly_transactions.plus.size = data.frame(matrix(0, ncol = length(unique(sales.plus.size$item_size)), nrow = length(unique(week_start_date))))
r = 1
for (j in 1:length(unique(week_start_date)))
{
c=1
for (k in 1:length(unique(sales.plus.size$item_size)))
{
store2_weekly_transactions.plus.size[r,c] = sum(sales.plus.size[sales.plus.size$week_start_date == unique(week_start_date)[j] & sales.plus.size$item_size ==unique(sales.plus.size$item_size)[k] & sales.plus.size$store ==2,]$transaction_quantity)
c = c+1
}
r=r+1
}
store2_weekly_transactions.plus.size = cbind(unique(week_start_date), store2_weekly_transactions.plus.size)
names(store2_weekly_transactions.plus.size) = c("week_start_date",sapply(unique(sales.plus.size$item_size),toString))
store2_weekly_transactions.plus.size = store2_weekly_transactions.plus.size[order(store2_weekly_transactions.plus.size$week_start_date),]
head(store2_weekly_transactions.plus.size)
## week_start_date L REG S REG M REG XL REG XXL REG XS REG XXS REG 3XL REG
## 2 2016-01-03 0 0 0 0 0 0 0 0
## 1 2016-01-10 0 0 0 0 0 0 0 0
## 6 2016-01-17 0 0 0 0 0 0 0 0
## 3 2016-01-24 0 0 0 0 0 0 0 0
## 4 2016-01-31 0 0 0 0 0 0 0 0
## 8 2016-02-07 0 0 0 0 0 0 0 0
write.csv(store2_weekly_transactions.plus.size,"regular_with_size.csv",row.names = FALSE)
write.table(store2_weekly_transactions.plus.size, "regular_with_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)
Lastly, we do the same thing, but for total sales accross both stores.
#Total + size
total_weekly_transactions.plus.size = data.frame(matrix(0, ncol = length(unique(sales.plus.size$item_size)), nrow = length(unique(week_start_date))))
r = 1
for (j in 1:length(unique(week_start_date)))
{
c=1
for (k in 1:length(unique(sales.plus.size$item_size)))
{
total_weekly_transactions.plus.size[r,c] = sum(sales.plus.size[sales.plus.size$week_start_date == unique(week_start_date)[j] & sales.plus.size$item_size ==unique(sales.plus.size$item_size)[k],]$transaction_quantity)
c = c+1
}
r=r+1
}
total_weekly_transactions.plus.size = cbind(unique(week_start_date), total_weekly_transactions.plus.size)
names(total_weekly_transactions.plus.size) = c("week_start_date",sapply(unique(sales.plus.size$item_size),toString))
total_weekly_transactions.plus.size = total_weekly_transactions.plus.size[order(total_weekly_transactions.plus.size$week_start_date),]
head(total_weekly_transactions.plus.size)
## week_start_date L REG S REG M REG XL REG XXL REG XS REG XXS REG 3XL REG
## 2 2016-01-03 60 29 52 29 8 6 0 0
## 1 2016-01-10 28 23 26 17 10 8 0 0
## 6 2016-01-17 50 20 47 26 10 3 0 0
## 3 2016-01-24 46 13 32 19 9 5 0 0
## 4 2016-01-31 25 8 27 10 6 9 0 0
## 8 2016-02-07 48 21 48 36 16 3 0 0
write.csv(total_weekly_transactions.plus.size,"total_with_size.csv",row.names = FALSE)
write.table(total_weekly_transactions.plus.size, "total_with_size.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)
I will create the following model: a 52-week projection for total sales and the distribution of appropriate sizes for each week. Let’s first demonstrate a forcast for just the discount store with no consideration of sizing. We will use a seasonal ARIMA model.
#Load in the data and make it timeseries data.
discount_without_size = read.delim('discount_without_size.txt', header = TRUE, sep = "\t", dec = ".")
ts_dis = ts(discount_without_size$transaction_quantity, start=c(2016,1), freq=52)
#remove the last entry since the week is incomplete.
ts_dis.red = ts_dis[1:202]
library(astsa)
##
## Attaching package: 'astsa'
## The following object is masked _by_ '.GlobalEnv':
##
## sales
library(forecast)
##
## Attaching package: 'forecast'
## The following object is masked from 'package:astsa':
##
## gas
We first need to examine what type of differences to take with the data. It is usually the case to take one difference in seasonal trends, and if there is still a positive or negative trend in the data, we take an additional difference between neighboring points, for the purpose of producing a stationary model. In the situation below, we take both types of differences and in the end get what appears to be a stationary series.
plot(ts_dis.red, type = "l",main="Transactions From Discount Store", xlab="Week's Since 2016-01-03",ylab="Transactions")
#take seasonal differences
plot(diff(ts_dis.red,52),type="l",main="Diff(TS,12)", xlab="Week's Since 2016-01-03",ylab="Transactions")
#Still a trend, so take neiboring differences
plot(diff(diff(ts_dis.red,52)),type="l",main="Diff(Diff(TS,12),1)", xlab="Week's Since 2016-01-03",ylab="Transactions")
Next, we analyze the auto-correlations and partial autocorrelations to determine what our orders for the AR, MA and Seasonal AR and MA should be. A rule of thumb for deciding which orders are to be used in the model is the following: Early spikes in ACF correspond to the order in the MA part. Early spikes in the PACF correspond to the order of the AR part of the model. If we see spikes at the lag correspond to the length of the season (52 in our case), we similarly get the order for the season AR and MA part.
b = acf2(diff(diff(ts_dis.red,52)), 100)
In our situation we see two early spikes in the ACF an 2 early spikes in PACF, so our non-season parameters can be (2,1,2). We see one spike around lag 52 in both, so our seasonal parameters can be (1,1,1). Running this forecast we get a forecast for 52 weeks and a plot with 80% and 95% CI’s.
model = Arima(ts_dis.red, order=c(2,1,2), seasonal=list(order=c(1,1,1), period=52), method="CSS")
summary(model)
## Series: ts_dis.red
## ARIMA(2,1,2)(1,1,1)[52]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1
## -0.6055 0.2713 0.1270 -0.9941 -0.0153 -0.6906
## s.e. 0.0274 0.0092 0.0299 0.0338 0.0759 0.1459
##
## sigma^2 estimated as 177984: part log likelihood=-1142.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -27.36349 354.963 182.8613 -2.127001 5.776429 0.1916225
## ACF1
## Training set 0.0008177134
forecast(model,h=52)
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 203 11455.273 10862.335 12048.211 10548.4524 12362.093
## 204 4021.704 3383.073 4660.335 3045.0029 4998.405
## 205 5088.672 4443.538 5733.807 4102.0242 6075.321
## 206 6618.490 5968.022 7268.957 5623.6854 7613.294
## 207 9527.449 8875.185 10179.712 8529.8978 10525.000
## 208 6828.120 6173.166 7483.075 5826.4535 7829.787
## 209 3472.566 2823.187 4121.945 2479.4265 4465.706
## 210 2039.902 1387.461 2692.343 1042.0796 3037.725
## 211 1906.504 1252.370 2560.638 906.0920 2906.915
## 212 2040.750 1383.729 2697.770 1035.9238 3045.576
## 213 1872.799 1214.000 2531.599 865.2527 2880.346
## 214 1987.719 1326.168 2649.270 975.9646 2999.474
## 215 2726.064 2062.663 3389.464 1711.4799 3740.648
## 216 3039.414 2373.372 3705.456 2020.7904 4058.037
## 217 2520.037 1852.089 3187.986 1498.4979 3541.577
## 218 2600.226 1929.726 3270.725 1574.7856 3625.666
## 219 2867.705 2195.254 3540.155 1839.2802 3896.129
## 220 2671.108 1996.182 3346.033 1638.8984 3703.317
## 221 2422.761 1745.850 3099.672 1387.5149 3458.008
## 222 2090.052 1410.730 2769.375 1051.1179 3128.986
## 223 1950.339 1269.005 2631.674 908.3286 2992.350
## 224 2332.948 1649.256 3016.640 1287.3315 3378.565
## 225 2453.653 1767.930 3139.375 1404.9304 3502.375
## 226 2362.214 1674.179 3050.248 1309.9554 3414.472
## 227 2425.184 1735.105 3115.262 1369.7997 3480.568
## 228 2924.780 2232.429 3617.131 1865.9203 3983.640
## 229 3934.165 3239.761 4628.568 2872.1664 4996.163
## 230 3892.398 3195.756 4589.040 2826.9762 4957.820
## 231 2404.267 1705.568 3102.966 1335.6996 3472.835
## 232 2898.028 2197.121 3598.936 1826.0825 3969.974
## 233 3069.887 2366.920 3772.853 1994.7926 4144.980
## 234 3971.444 3266.295 4676.592 2893.0119 5049.875
## 235 4583.853 3876.647 5291.059 3502.2753 5665.431
## 236 2721.750 2012.385 3431.115 1636.8703 3806.631
## 237 2822.513 2111.094 3533.932 1734.4915 3910.535
## 238 3657.248 2943.691 4370.805 2565.9565 4748.540
## 239 4454.504 3738.897 5170.110 3360.0782 5548.929
## 240 4582.695 3864.969 5300.420 3485.0280 5680.361
## 241 4170.466 3450.698 4890.235 3069.6757 5271.257
## 242 3926.459 3204.588 4648.329 2822.4532 5030.465
## 243 4607.876 3883.970 5331.782 3500.7579 5714.994
## 244 4540.939 3814.947 5266.931 3430.6295 5651.248
## 245 2459.960 1731.941 3187.979 1346.5513 3573.369
## 246 3293.390 2563.299 4023.481 2176.8122 4409.968
## 247 4786.692 4054.584 5518.800 3667.0293 5906.355
## 248 4924.382 4190.215 5658.549 3801.5698 6047.194
## 249 4839.099 4102.924 5575.273 3713.2173 5964.980
## 250 4126.613 3388.392 4864.835 2997.6012 5255.626
## 251 3775.008 3034.790 4515.225 2642.9431 4907.072
## 252 3570.063 2827.810 4312.317 2434.8846 4705.242
## 253 3966.010 3221.773 4710.248 2827.7971 5104.223
## 254 2930.625 2184.361 3676.889 1789.3124 4071.938
plot(forecast(model,h=52), main = "Forcast for Discount Store")
Due to the size of the output from the forecast function, I will hide its output for the remainder. Another way we can pick the parameters is to use the auto.arima function which runs a grid search over the parameters and picks the model with the best BIC. This can sometime produce results that might seem suspicious, i.e. no seasonal part.
#this model agrees with a grid search over BIC
auto.arima(ts_dis)
## Series: ts_dis
## ARIMA(0,1,2)(1,1,0)[52]
##
## Coefficients:
## ma1 ma2 sar1
## -0.5443 -0.3921 -0.2776
## s.e. 0.0827 0.0819 0.1011
##
## sigma^2 estimated as 397528: log likelihood=-1173.52
## AIC=2355.04 AICc=2355.32 BIC=2367.06
We see the auto.arima functions gives the parameters (0,1,2)(1,1,0). We can run this model as well and see a plot.
#model choice given by auto.arima.
model2 = Arima(ts_dis.red, order=c(0,1,2), seasonal=list(order=c(1,1,0), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for Discount Store - Auto.Model")
Now that we’ve seen the process for how to build a SARIMA model we can begin to build our final demand forecast.
We proceed just as before.
#Load in the data and make it timeseries data.
total_without_size = read.delim('total_without_size.txt', header = TRUE, sep = "\t", dec = ".")
ts_tot = ts(discount_without_size$transaction_quantity, start=c(2016,1), freq=52)
#remove the last entry since the week is incomplete.
ts_tot.red = ts_tot[1:202]
plot(ts_tot.red, type = "l",main="Total Transactions", xlab="Week's Since 2016-01-03",ylab="Transactions")
#take seasonal differences
plot(diff(ts_tot.red,52),type="l",main="Diff(TS,12)", xlab="Week's Since 2016-01-03",ylab="Transactions")
#Still a trend, so take neiboring differences
plot(diff(diff(ts_tot.red,52)),type="l",main="Diff(Diff(TS,12),1)", xlab="Week's Since 2016-01-03",ylab="Transactions")
b = acf2(diff(diff(ts_tot.red,52)), 100)
Based on the ACF and PACF plot we get the model parameters should be (2,1,2)(1,1,1)
model = Arima(ts_tot.red, order=c(2,1,2), seasonal=list(order=c(1,1,1), period=52), method="CSS")
summary(model)
## Series: ts_tot.red
## ARIMA(2,1,2)(1,1,1)[52]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1
## -0.6055 0.2713 0.1270 -0.9941 -0.0153 -0.6906
## s.e. 0.0274 0.0092 0.0299 0.0338 0.0759 0.1459
##
## sigma^2 estimated as 177984: part log likelihood=-1142.55
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -27.36349 354.963 182.8613 -2.127001 5.776429 0.1916225
## ACF1
## Training set 0.0008177134
plot(forecast(model,h=52), main = "Forcast for Total Sales - My Model")
Using the auto.arima function gives:
#this model agrees with a grid search over BIC
auto.arima(ts_tot)
## Series: ts_tot
## ARIMA(0,1,2)(1,1,0)[52]
##
## Coefficients:
## ma1 ma2 sar1
## -0.5443 -0.3921 -0.2776
## s.e. 0.0827 0.0819 0.1011
##
## sigma^2 estimated as 397528: log likelihood=-1173.52
## AIC=2355.04 AICc=2355.32 BIC=2367.06
which gives the parameters (0,1,2)(1,1,0).
#model choice given by auto.arima.
model2 = Arima(ts_tot.red, order=c(0,1,2), seasonal=list(order=c(1,1,0), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for Total Sales - Auto.Model")
To choose which model to chose, we compare with previous data. We can look at how the models are able to predict the last 52 weeks of the training set. The lines in red are our known training set, and the blue represents our prediction.
#Load in the data and make it timeseries data.
total_without_size = read.delim('total_without_size.txt', header = TRUE, sep = "\t", dec = ".")
ts_tot = ts(discount_without_size$transaction_quantity, start=c(2016,1), freq=52)
#remove the last entry since the week is incomplete.
ts_tot.test = ts_tot[1:150]
model = Arima(ts_tot.test, order=c(2,1,2), seasonal=list(order=c(0,1,1), period=52), method="CSS")
summary(model)
## Series: ts_tot.test
## ARIMA(2,1,2)(0,1,1)[52]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sma1
## -0.5541 0.1413 0.0034 -0.9509 -0.3154
## s.e. 0.1008 0.1037 0.0352 0.0351 0.1669
##
## sigma^2 estimated as 389319: part log likelihood=-760.38
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 46.70415 488.6538 277.491 0.6620145 8.411441 0.3133448
## ACF1
## Training set -0.01298397
plot(forecast(model,h=52), main = "Forcast for Total Sales TEST- My Model")
lines(ts_tot[1:201], type="l", col="red")
#model choice given by auto.arima.
model2 = Arima(ts_tot.test, order=c(0,1,2), seasonal=list(order=c(1,1,0), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for Total Store TEST- Auto.Model")
lines(ts_tot[1:201], type="l", col="red")
myforecast = data.frame(forecast(model,52))
autoforecast = data.frame(forecast(model2,52))
sum(abs(myforecast$Point.Forecast-ts_tot[151:202]))
## [1] 36953.72
sum(abs(autoforecast$Point.Forecast-ts_tot[151:202]))
## [1] 30710.75
Since the error for the model used by Auto.Arima has a smaller error, we chose it as our 52-week demand forcast.
demand_forecast = data.frame(forecast(model2,52))
We now incorporate the size data into our model. My plan for this is the following: Create a 52-week demand forecast for all 8 sizes. Each size, then reprsents a proportion of the total amount of predicted sized transactions. We multiply this proportion by our prediction given in total sales to finally get a prediction number for each size, for each week.
ALL the code for each one of these models is build using the exact same method shown above.
total_with_size = read.delim('total_with_size.txt', header = TRUE, sep = "\t", dec = ".")
ts_large = ts(total_with_size$L.REG, start=c(2016,1), freq=52)
ts_large.red = ts_large[1:202]
b = acf2(diff(diff(ts_dis.red,52)), 100)
model = Arima(ts_large.red, order=c(2,1,2), seasonal=list(order=c(1,1,1), period=52), method="CSS")
summary(model)
## Series: ts_large.red
## ARIMA(2,1,2)(1,1,1)[52]
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sma1
## 0.2627 0.1851 -0.5755 -0.557 -0.0218 -0.6540
## s.e. 0.0333 0.0882 NaN NaN 0.0096 0.1719
##
## sigma^2 estimated as 291.7: part log likelihood=-664.73
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.1470356 14.36995 7.360439 -10.94277 22.06488 0.3711611
## ACF1
## Training set -0.0007727724
plot(forecast(model,h=52), main = "Forcast for Large Coats - My Model")
auto.arima(ts_large)
## Series: ts_large
## ARIMA(0,1,0)(1,1,0)[52]
##
## Coefficients:
## sar1
## -0.1751
## s.e. 0.1139
##
## sigma^2 estimated as 652.1: log likelihood=-694.51
## AIC=1393.01 AICc=1393.09 BIC=1399.02
model2 = Arima(ts_large.red, order=c(2,1,1), seasonal=list(order=c(0,1,1), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for Large Store - Auto.Model")
#We choose my model.
forecast_large = data.frame(forecast(model,h=52))
demand_forecast = cbind(demand_forecast,forecast_large$Point.Forecast)
ts_med = ts(total_with_size$M.REG, start=c(2016,1), freq=52)
ts_med.red = ts_med[1:202]
b = acf2(diff(diff(ts_med.red,52)), 100)
model = Arima(ts_med.red, order=c(3,1,2), seasonal=list(order=c(1,1,1), period=52), method="CSS")
summary(model)
## Series: ts_med.red
## ARIMA(3,1,2)(1,1,1)[52]
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 sar1 sma1
## -0.8698 -0.9328 -0.4695 0.8737 0.6678 -0.5969 0.4423
## s.e. 0.1196 0.0864 0.0883 0.1171 0.0922 0.0474 0.2175
##
## sigma^2 estimated as 236.7: part log likelihood=-649.43
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1764305 12.89933 6.185121 -7.227211 26.07649 0.312757
## ACF1
## Training set -0.01024261
plot(forecast(model,h=52), main = "Forcast for Medium Coats - My Model")
auto.arima(ts_med)
## Series: ts_med
## ARIMA(2,1,4)(0,1,0)[52]
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 ma4
## -0.1889 0.6507 -0.3337 -0.9703 0.2565 0.0976
## s.e. 0.4312 0.3379 0.4411 0.4939 0.1991 0.1252
##
## sigma^2 estimated as 862.9: log likelihood=-712.67
## AIC=1439.34 AICc=1440.14 BIC=1460.37
model2 = Arima(ts_med.red, order=c(2,1,2), seasonal=list(order=c(1,1,0), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for Medium Coats - Auto.Model")
forecast_medium = data.frame(forecast(model2,h=52))
demand_forecast = cbind(demand_forecast, forecast_medium$Point.Forecast)
ts_small = ts(total_with_size$S.REG, start=c(2016,1), freq=52)
ts_small.red = ts_small[1:202]
b = acf2(diff(diff(ts_small.red,52)), 100)
model = Arima(ts_small.red, order=c(3,1,1), seasonal=list(order=c(0,1,1), period=52), method="CSS")
summary(model)
## Series: ts_small.red
## ARIMA(3,1,1)(0,1,1)[52]
##
## Coefficients:
## ar1 ar2 ar3 ma1 sma1
## 0.3931 0.0432 0.1443 -0.9762 -0.1668
## s.e. 0.0909 0.0932 0.0907 0.0225 0.1051
##
## sigma^2 estimated as 225.8: part log likelihood=-614.15
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 1.100382 12.68656 6.43006 Inf Inf 0.5703628 -0.01032663
plot(forecast(model,h=52), main = "Forcast for Small Coats - My Model")
auto.arima(ts_small)
## Series: ts_small
## ARIMA(1,0,3) with non-zero mean
##
## Coefficients:
## ar1 ma1 ma2 ma3 mean
## 0.8247 -0.4586 -0.0703 0.1673 24.0780
## s.e. 0.0645 0.0878 0.0713 0.0815 4.8919
##
## sigma^2 estimated as 390.5: log likelihood=-887.12
## AIC=1786.25 AICc=1786.68 BIC=1806.1
model2 = Arima(ts_small.red, order=c(1,0,1), seasonal=list(order=c(0,0,1), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for Small Coats - Auto.Model")
forecast_small = data.frame(forecast(model,h=52))
demand_forecast = cbind(demand_forecast, forecast_small$Point.Forecast)
ts_XXL = ts(total_with_size$XXL.REG, start=c(2016,1), freq=52)
ts_XXL.red = ts_XXL[1:202]
b = acf2(diff(diff(ts_XXL.red,52)), 100)
model = Arima(ts_XXL.red, order=c(2,1,1), seasonal=list(order=c(0,1,0), period=52), method="CSS")
summary(model)
## Series: ts_XXL.red
## ARIMA(2,1,1)(0,1,0)[52]
##
## Coefficients:
## ar1 ar2 ma1
## 0.0935 -0.0370 -0.8163
## s.e. 0.1222 0.1093 0.0904
##
## sigma^2 estimated as 36.17: part log likelihood=-478.23
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2636462 5.112776 3.253087 -Inf Inf 0.7084188 -0.002819882
plot(forecast(model,h=52), main = "Forcast for XXL Coats - My Model")
auto.arima(ts_XXL)
## Series: ts_XXL
## ARIMA(1,0,1)(0,0,1)[52] with non-zero mean
##
## Coefficients:
## ar1 ma1 sma1 mean
## 0.8531 -0.5162 0.4703 9.0614
## s.e. 0.0594 0.0966 0.1007 1.6863
##
## sigma^2 estimated as 29.95: log likelihood=-634.67
## AIC=1279.35 AICc=1279.65 BIC=1295.89
model2 = Arima(ts_XXL.red, order=c(1,0,1), seasonal=list(order=c(0,1,1), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for XXL Coats - Auto.Model")
forecast_XXL = data.frame(forecast(model,h=52))
demand_forecast = cbind(demand_forecast, forecast_XXL$Point.Forecast)
ts_XXL = ts(total_with_size$XXL.REG, start=c(2016,1), freq=52)
ts_XXL.red = ts_XXL[1:202]
b = acf2(diff(diff(ts_XXL.red,52)), 100)
model = Arima(ts_XXL.red, order=c(2,1,1), seasonal=list(order=c(0,1,0), period=52), method="CSS")
summary(model)
## Series: ts_XXL.red
## ARIMA(2,1,1)(0,1,0)[52]
##
## Coefficients:
## ar1 ar2 ma1
## 0.0935 -0.0370 -0.8163
## s.e. 0.1222 0.1093 0.0904
##
## sigma^2 estimated as 36.17: part log likelihood=-478.23
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.2636462 5.112776 3.253087 -Inf Inf 0.7084188 -0.002819882
plot(forecast(model,h=52), main = "Forcast for XXL Coats - My Model")
auto.arima(ts_XXL)
## Series: ts_XXL
## ARIMA(1,0,1)(0,0,1)[52] with non-zero mean
##
## Coefficients:
## ar1 ma1 sma1 mean
## 0.8531 -0.5162 0.4703 9.0614
## s.e. 0.0594 0.0966 0.1007 1.6863
##
## sigma^2 estimated as 29.95: log likelihood=-634.67
## AIC=1279.35 AICc=1279.65 BIC=1295.89
model2 = Arima(ts_XXL.red, order=c(1,0,1), seasonal=list(order=c(0,1,1), period=52), method="CSS")
plot(forecast(model2,h=52), main = "Forcast for XXL Coats - Auto.Model")
forecast_XXL = data.frame(forecast(model,h=52))
demand_forecast = cbind(demand_forecast, forecast_XXL$Point.Forecast)
The last 3 sizes 3XL, XS, and XXS are so infequently bought, that a time-series model is not the best model, and especially not a seasonal arima model. There are few of the coats purchased, and at irregular times, that it makes sense to supply them periodically as necessary. ###3XL
ts_3XL = ts(total_with_size$X3XL.REG, start=c(2016,1), freq=52)
ts_3XL.red = ts_3XL[1:202]
plot(ts_3XL.red, type = "l", Main = "3XL Transactions", xlab="Week's Since 2016-01-03",ylab="Transactions")
sum(total_with_size$X3XL.REG)/202
## [1] 0.4059406
forecast_3XL = c(rep(.4,52))
forecast_3XL = data.frame(forecast_3XL)
demand_forecast = cbind(demand_forecast,forecast_3XL)
ts_XS = ts(total_with_size$XS.REG, start=c(2016,1), freq=52)
ts_XS.red = ts_XS[1:202]
plot(ts_XS.red, type = "l", Main = "XS Transactions", xlab="Week's Since 2016-01-03",ylab="Transactions")
sum(total_with_size$XS.REG)/202
## [1] 0.8267327
forecast_XS = c(rep(.827,52))
forcast_XS = data.frame(forecast_XS)
demand_forecast = cbind(demand_forecast,forecast_XS )
ts_XXS = ts(total_with_size$XXS.REG, start=c(2015,52), freq=52)
ts_XXS.red = ts_XXS[1:202]
plot(ts_XXS.red, type = "l",Main = "XS Transactions", xlab="Week's Since 2016-01-03",ylab="Transactions")
sum(ts_XXS.red)/202
## [1] 0.02475248
forecast_XXS = c(rep(0.024,52))
forecast_XXS = data.frame(forecast_XXS)
demand_forecast = cbind(demand_forecast, forecast_XXS)
We now use these forecasts to create proportions of the total sales allocated to each size. To do this we run a nested for loop. The heart the matter is the function inside the loops. We makes sure the columns are named appropriatedly, and round each prediction to the nearest integer. We write out prediction to a CSV and TXT file. I also have printed the prediction here.
names(demand_forecast) = c("Total_Sales","Low 80","High 80","Low 95","High 95", "L","M","S","XL","XXL","3XL", "XS", "XXS")
#We now have a demand forcast, but we need a proportional estimate for sizes.
size_proportions = data.frame(matrix(0, ncol = 8, nrow = 52))
for (i in 1:52)
{
for (j in 1:8)
{
size_proportions[i,j] = (demand_forecast[i,5+j]/sum(demand_forecast[i,6:13]))*demand_forecast[i,1]
}
}
demand_forecast = cbind(demand_forecast$Total_Sales, demand_forecast$`Low 80`, demand_forecast$`High 80`, demand_forecast$`Low 95`, demand_forecast$`High 95`, size_proportions)
names(demand_forecast) = c("Point Forecast", "Low 80", "High 80", "Low 95", "High 95", "L","M","S","XL","XXL","3XL","XS","XXS")
demand_forecast = round(demand_forecast)
start = as.Date("2019-11-11")
dates = c(start)
for (i in 1:51)
{
dates = c(dates,start+i*7)
}
dates = data.frame(dates)
demand_forecast = cbind(dates,demand_forecast)
names(demand_forecast) = c("Week_Start_Date","Point_Forecast", "Low_80", "High_80", "Low_95", "High_95", "L","M","S","XL","XXL","3XL","XS","XXS")
write.csv(demand_forecast,"demand_forecast.csv",row.names = FALSE)
write.table(demand_forecast, "demand_forecast.txt", append = FALSE, sep = "\t", dec = ".",row.names = FALSE, col.names = TRUE)
demand_forecast
## Week_Start_Date Point_Forecast Low_80 High_80 Low_95 High_95 L M
## 1 2019-11-11 11636 11151 12122 10894 12379 3356 3889
## 2 2019-11-18 4380 3864 4896 3591 5168 1039 1481
## 3 2019-11-25 5774 5226 6321 4936 6611 1641 2113
## 4 2019-12-02 7007 6428 7585 6122 7891 2416 2200
## 5 2019-12-09 9716 9109 10324 8788 10645 3278 3334
## 6 2019-12-16 6560 5925 7195 5590 7531 1804 2352
## 7 2019-12-23 3760 3099 4421 2749 4771 996 1311
## 8 2019-12-30 2471 1785 3158 1421 3521 668 997
## 9 2020-01-06 2257 1546 2968 1169 3344 658 700
## 10 2020-01-13 2135 1400 2869 1011 3258 575 791
## 11 2020-01-20 2348 1590 3105 1189 3506 658 944
## 12 2020-01-27 1926 1147 2706 734 3119 579 601
## 13 2020-02-03 3254 2452 4055 2028 4480 1113 1096
## 14 2020-02-10 3754 2932 4577 2496 5012 1061 1141
## 15 2020-02-17 2988 2145 3831 1699 4278 871 983
## 16 2020-02-24 2847 1984 3710 1527 4167 867 879
## 17 2020-03-02 3182 2299 4065 1832 4532 1046 1009
## 18 2020-03-09 2727 1825 3629 1347 4106 898 901
## 19 2020-03-16 2667 1746 3588 1259 4075 779 978
## 20 2020-03-23 2283 1344 3222 846 3719 620 710
## 21 2020-03-30 2385 1428 3343 921 3849 769 806
## 22 2020-04-06 2607 1632 3582 1116 4098 768 921
## 23 2020-04-13 2836 1844 3828 1318 4354 1052 928
## 24 2020-04-20 2668 1659 3678 1124 4212 975 912
## 25 2020-04-27 2791 1765 3818 1222 4361 780 943
## 26 2020-05-04 3023 1980 4066 1428 4618 1034 1132
## 27 2020-05-11 3676 2617 4735 2056 5296 1160 1162
## 28 2020-05-18 3994 2919 5069 2350 5638 1086 1314
## 29 2020-05-25 2435 1344 3526 766 4103 789 856
## 30 2020-06-01 3134 2028 4241 1442 4827 1116 998
## 31 2020-06-08 3375 2253 4497 1659 5091 927 1149
## 32 2020-06-15 4040 2903 5177 2301 5779 1173 1258
## 33 2020-06-22 5043 3891 6195 3281 6805 1430 1879
## 34 2020-06-29 3227 2060 4394 1443 5011 900 1157
## 35 2020-07-06 3301 2120 4482 1494 5108 921 1206
## 36 2020-07-13 3994 2798 5190 2165 5823 1459 1457
## 37 2020-07-20 4695 3485 5905 2845 6546 1187 1531
## 38 2020-07-27 4800 3576 6024 2928 6672 1431 1682
## 39 2020-08-03 4454 3216 5692 2561 6348 1304 1357
## 40 2020-08-10 4216 2964 5467 2301 6130 1331 1538
## 41 2020-08-17 4405 3139 5670 2470 6340 1316 1360
## 42 2020-08-24 4523 3245 5802 2568 6479 1321 1466
## 43 2020-08-31 2344 1052 3637 368 4320 738 843
## 44 2020-09-07 3458 2153 4764 1462 5454 1044 1054
## 45 2020-09-14 4746 3428 6065 2730 6762 1434 1417
## 46 2020-09-21 4395 3064 5726 2359 6431 1252 1354
## 47 2020-09-28 5110 3766 6454 3054 7165 1349 1643
## 48 2020-10-05 4567 3211 5924 2492 6642 1417 1380
## 49 2020-10-12 3975 2606 5344 1881 6069 1244 1339
## 50 2020-10-19 3953 2572 5335 1840 6066 1103 1368
## 51 2020-10-26 4163 2769 5557 2031 6295 1208 1406
## 52 2020-11-02 3563 2157 4970 1413 5714 1245 1272
## S XL XXL 3XL XS XXS
## 1 2178 1096 1096 7 14 0
## 2 907 466 466 6 13 0
## 3 1082 458 458 7 14 0
## 4 1592 388 388 7 15 0
## 5 1744 670 670 7 14 0
## 6 1568 411 411 5 10 0
## 7 962 238 238 5 10 0
## 8 511 136 136 7 15 0
## 9 538 172 172 5 11 0
## 10 503 125 125 5 10 0
## 11 366 182 182 5 11 0
## 12 449 141 141 5 10 0
## 13 603 211 211 6 13 0
## 14 772 380 380 6 13 0
## 15 625 244 244 6 13 0
## 16 610 237 237 6 12 0
## 17 408 350 350 6 12 0
## 18 423 244 244 6 11 0
## 19 540 175 175 7 14 0
## 20 424 255 255 6 11 0
## 21 430 179 179 7 15 0
## 22 490 202 202 8 16 0
## 23 473 178 178 8 17 1
## 24 450 153 153 8 17 0
## 25 611 214 214 9 19 1
## 26 417 203 203 11 22 1
## 27 652 337 337 9 18 1
## 28 832 361 361 13 26 1
## 29 492 127 127 14 30 1
## 30 608 182 182 16 33 1
## 31 633 313 313 13 27 1
## 32 689 435 435 16 34 1
## 33 873 401 401 19 39 1
## 34 660 233 233 14 29 1
## 35 535 298 298 14 29 1
## 36 622 201 201 18 36 1
## 37 972 477 477 17 34 1
## 38 1018 305 305 19 39 1
## 39 912 418 418 14 30 1
## 40 771 268 268 13 26 1
## 41 830 434 434 10 20 1
## 42 1099 304 304 10 20 1
## 43 469 134 134 8 17 0
## 44 635 348 348 9 19 1
## 45 1303 284 284 8 16 0
## 46 977 395 395 7 14 0
## 47 1124 488 488 6 12 0
## 48 886 431 431 7 15 0
## 49 677 349 349 6 12 0
## 50 805 331 331 5 11 0
## 51 926 302 302 6 12 0
## 52 552 236 236 7 14 0